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_ ' Abstract 

X5 ; 

^ I Working under a model of privacy in which data remains private even from the statistician, 

we study the tradeoff between privacy guarantees and the utility of the resulting statistical 
CO I estimators. We prove bounds on information-theoretic quantities, including mutual information 

and KuUback-Leibler divergence, that influence estimation rates as a function of the amount of 
privacy preserved. When combined with standard minimax techniques such as Le Cam's and 
f-H ' Fano's methods, these inequalities allow for a precise characterization of statistical rates under 

I local privacy constraints. In this paper, we provide a complete treatment of three canonical 

problem families: mean estimation in location family models, parameter estimation in flxed- 
design regression, and convex risk minimization. For all of these families, we provide lower and 
upper bounds that match up to constant factors, giving privacy-preserving mechanisms and 
computationally efficient estimators that achieve the bounds. 



2 ! 1 Introduction 

■ A major challenge in statistical inference is that of characterizing and controlling the balance 



fSj I between statistical efficiency and the privacy of individuals from whom data is obtained [13l.ll4l.l2 



' Such a characterization requires a formal definition of privacy. In recent years, the notion of 



dif ferential privacy has been put forth as one formal definition of privacy (e.g., 0, 0, l23l . 



I24I. I9I. [27I. 311). In the database and cryptography literatures from which differential privacy arose 
the focus has been algorithmic; in particular, researchers have used differential privacy to evaluate 
^ ■ privacy-retaining mechanisms for transporting, indexing, and querying data. More recent work 

aims to link differential privacy to statistical objectives H 0, 0, B, [l3, B; stih, the focus in the 



bulk of this work has been on specific mechanisms for achieving differential privacy. 

In this paper, we take a more abstract approach to studying the interplay between inference and 
privacy, one in which differential privacy acts as a constraint on a data analysis, but the analysis 
remains agnostic to the particular privacy-enforcing mechanism. We do so by working within a 
statistical decision-theoretic framework, and studying the minimax risks associated with various 
estimation problems under abstract differential privacy constraints. This minimax framework allows 
us to obtain fundamental bounds that hold uniformly for classes of inferential procedures regardless 
of the particular mechanisms used to achieve differential privacy. Having obtained lower bounds 
on risk that incorporate differential privacy, we also provide matching upper bounds via specific 
algorithms. The overall goal is that of bringing differential privacy into close contact with the 
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foundational concepts of statistical decision theory, as well as to provide quantitative tradeoffs that 
can inform practice. 

In line with our focus on fundamental limits, we study the strong setting of local privacy, where 
data providers trust no one, not even the statistician collecting the data. Local privacy is one 



of the oldest forms of privacy, and its essential form dates back to Warner |35l ]. who proposed 
it as a remedy for what he termed "evasive answer bias" in survey sampling. More formally, let 
Xi , . . . , Xn G X he samples drawn according to some distribution P. We consider procedures for es- 
timating a parameter 9 = 6{P) of the unknown distribution that have access only to obscured views, 
Zi, . . . , Z„ G ^, of the original data. The original {Xj}"^^ and the privatized {Zi}f^^ random vari- 
ables are linked via a consistent family of conditional distributions Qi{Zi \ Xi = x, Zj = Zj^ ^ i). 
To simplify notation, we typically omit the subscript in Qj, as it is clear from the context jj Since 
it acts as a conduit from the original to the privatized data, we refer to Q as a channel distri- 
bution. Note that the dependence of the channel distribution on all of the obscured data points 
allows us to highlight one of the advantages of the differential privacy framework, in particular its 
robustness to "interactivity" — that data release mechanisms may change depending on what has 



been released [17|]. Such robustness, together with the treatment of issues of side information or 
adversarial strength that are problematic for other formalisms, have been used to make the case for 
differential privacy within the computer science literature; see, for example, the papers 1^, 17,0]. 



Although differential privacy provides an elegant formalism for limiting disclosure and protecting 
against many forms of privacy breach, it is a stringent measure of privacy, and it is conceivably 



overly stringent for statistical practice. Indeed, Fienberg et al. [21[ criticize the use of differential 
privacy in releasing contingency tables, arguing that known mechanisms for differentially private 
data release can give unacceptably poor performance. As a consequence, they advocate — in some 
cases — recourse to weaker privacy guarantees to maintain the utility and usability of released data. 
There are, however, results that are more favorable for differential privacy; for example. Smith 
[s^ l shows that in some parametric problems, the non-local form of differential privacy can be 
satisfied while yielding asymptotically optimal parametric rates of convergence for different point 
estimators. Hall et al. [23] also show minimax rates for histogram release in differentially private 
settings, giving a relaxed version of privacy to attain better convergence guarantees, and Chaudhuri 
and Hsu [3] give lower bounds for certain one dimensional statistics based on a two-point family. 
Resolving such differing perspectives requires investigation into whether particular methods have 
optimality properties that would allow a general criticism of the framework, and characterizing the 
trade-offs between privacy and statistical efficiency. Such are the goals of the current paper. 

Our work is based on the following general definition of local differential privacy. For a given 
privacy parameter a > 0, we say that Zi is an a- differentially locally private view of Xi if 

sup<^ ' _ ■ , I S G cr{Z),Zj £ Z, and x,x £ X } < exp(a), (1) 

1^ (^[D \A-i — X,Zjj — Zj,J^t) ) 

where cr{Z) denotes an appropriate u-field on Z. We also consider a simplification [l^, appropriate 
for non-interactive protocols, where Zi is generated based only on Xi: the bound ([T]) reduces to 

Q{S \ Xi = x) 

sup sup jr < exp(a). (2) 

S£a{Z) x,x'eX WW I — X j 



^ Formally, we define the full conditional distribution Q{Zi,...,Z„ \ Xi,...,X„), where Zi is conditionally 
independent of Xj given Zj, j i, and Xi, over which we may integrate to derive the consistent family of conditionals 
Qi. We write the full conditioning simply to indicate that Zi may depend on Zj in some settings. 
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Both of these definitions capture a type of plausible-deniabihty: no matter what data Z is released, 
it is nearly equally as likely to have come from any point x G as any other. It is also possible 
to interpret differential privacy within a hypothesis testing framework, where a controls the error 



rate in tests for the presence or absence of individual data points in a dataset [361 ] . 



1.1 Our contributions 

The main contribution of this work is to provide general techniques for deriving minimax bounds 
under local privacy constraints, and to illustrate the use of these techniques to compute the min- 
imax rates for three canonical problems: (a) mean estimation in location families; (b) parameter 
estimation in fixed design regression; and (c) convex risk minimization. 

Many standard methods for obtaining minimax bounds involve information-theoretic quantities, 
including the mutual information between certain random variables and the Kullback-Leibler (KL) 



divergence between different distributions that may have generated the data [see, e.g.. l38l. 1371. l3c 
In particular, let Pi and P2 denote two possible distributions that might have generated the data 
Xj, and for v G {1,2}, define the marginal distribution M" to be the distribution on Z"^ given by 

M:^{A) := J Q''{A\xu...,Xn)dP,{xu...,Xn) for^Ga(^"). (3) 

Here Q'^(- | xi, . . . , denotes the joint distribution on Z"" of the n samples Zi-^., conditioned 
on the initial data Xi-n = xi-n, based on the protocol for communication the inference algorithm 
and data providers use. The mutual information of samples drawn according to distributions 
of the form ([3]) and the KL divergence between such distributions are key objects in statistical 
discriminability and minimax rates 0, 3^, 37]. 

Keeping in mind the centrality of these information-theoretic quantities, our main results can 
be summarized at a high-level as follows. Theorem [T] provides a general result that bounds the 
KL divergence between distributions M" and M2, as defined by the marginal ([3]), by a quantity 
dependent on the differential privacy parameter a and the total variation distance between Pi and 
P2 , the initial distributions of the Xi . The essence of Theorem [1] is that 

I)ki {M^\\M^) < a^n \\P, - P2\\ly , 

where < denotes inequality up to constant factors. When < 1, which is the usual region of 
interest, this result shows that for statistical procedures whose minimax rate of convergence can 
be determined by classical information-theoretic methods, the additional requirement of a-local 
differential privacy causes the effective sample size of any statistical procedure to be reduced from 
n to a^n. Section 13.11 contains the formal statement of this theorem, while Section 13.21 provides 
corollaries that show its use in application to minimax risk bounds. We follow this in Section 13.31 
with applications of these results to estimation in location family models and fixed-design regression 
problems, providing corresponding upper bounds on the minimax risk. In accord with our general 
analysis, we see the reduction of effective sample size from n to a^n, but we also exhibit some 
striking difficulties of locally differentially private estimation in non-compact spaces. Indeed, if we 
wish to estimate the mean of a random variable X satisfying Var(X) < 1, the minimax rate of 
estimation of E[X] decreases from the parametric 1/n rate to \/^/nc?, which is quite substantial. 

Theorem [T] is appropriate for many problems in which only single-dimensional quantities are 
kept private, but does not address difficulties inherent in higher-dimensional problems. With this 
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motivation, our second main result (Theorem [2|) is a more powerful result that incorporates di- 
mensionality in an essential way. At a high level, it provides a general variational upper bound on 
information-theoretic quantities necessary for proving lower bounds, and we give a brief sketch of 
its applications here. Given multiple distributions M" of the form ([3]), where v ranges over some 
large set V indexing a set of possible distributions on the data X, we define the mean distribu- 
tion = i^E/^eV^"- Controlhng the average deviation {M^\\Jt) over V is essential in 
information theoretic techniques such as Fano's method 38|, |37[ for proving minimax lower bounds. 
Theorem [2] allows us to relate the covariance structure of the elements v to this average KL 
divergence. As a consequence, with appropriate choice of the set V, we obtain that for some d- 
dimensional statistical problems the effective sample size is reduced from n to no? jd^ which is 
substantial. We provide the main statement and consequences of Theorem [2] in Section U and in 
Section Owe present its application to obtaining minimax rates for private convex risk minimization 
problems. 



Notation: We briefly summarize our notation here. For distributions P and Q defined on a space 
each absolutely continuous with respect to a distribution (with corresponding densities p and 
g) the KL divergence between P and Q is defined by 

Dm{.P\\Q) := / d^log^= / plog^d/x. 

Jx "V JX Q 

Letting cr{X) denote the (an appropriate) o"-field on X, the total variation distance between the 
distributions P and Q is given by 

||P-Q||tv:= sup \P{S)-Q{S)\ = l [ \pix) - q{x)\ dij.{x). 

S€a{X) ^ JX 

For random vectors X and y, let (5(- | X) denote the distribution of Y conditional on X. The 
mutual information between X and Y is defined as 

1{X-Y) :=Ep [L>kl (Q(- I ^)||M(-))] = ^ I^kl (Q(- \ X = x)\\M{^) dP{x), 

where P and M are (respectively) the marginal distributions of X and Y . A random variable Y 
has Laplace(Q;) distribution if the density of Y is py(y) = § exp (— a|2/|), where a > 0. For 
matrices A, G M'^^'^, we use the notation A < B \,o mean that B — A\& positive semidefinite, 
and y4 -< S to mean that B — A\?, positive definite. For two real sequences {a„,} and we use 

On ^ &n to mean that there is a constant C < oo such that a„ < C6„ for all n, and a„, x 6„ to 
denote that o„ < and ^ a-n- For a convex function / : M'^ — >■ M, we use df{9) to denote its 
sub-differential at 0, namely the set 

dfie) := {(? G I fie') > fie) + (g, e' - e) for aii e' G M'^} . 



2 Background and problem formulation 



We begin by setting up the minimax framework used throughout this paper; see references [37], l38l . 



33l ] for further background. Let V denote a class of distributions on the sample space X, and let 
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0{P) G denote a function defined on V. The space G in which the parameter 0{P) takes values 
depends on the underlying statistical model (e.g., for univariate mean estimation, it is a subset of 
the real line). Let p denote a semi- metric on the space Q, which we use to measure the error of an 
estimator for the parameter 9, and we let ^ : M+ — )• M+ be a non-decreasing function with $(0) = 
(for example, $(t) = t^). 

In the classical setting, the statistician is given direct access to i.i.d. samples Xj drawn according 
to some P G V. The local privacy setting involves an additional ingredient — namely, a conditional 
distribution Q that transforms the samples Xi to the private samples Zi taking values in Z. Based 
on the observations {Zi, . . . ,Zn), our goal is to estimate the unknown parameter 6{P) G @. An 
estimator is a measurable function 8 : Z'^ — t- 0, and we assess the quality of the estimate 
0{Zi, . . . , Zn) in terms of the quantity 

Ep,Q[<I>(p(^(Zi,...,Z„,),e(P)))]. 

For instance, for a univariate mean problem with p{9, 9') = \9 — 9'\ and $(t) = t^, this error metric 
reduces to the mean-squared error. For any fixed conditional distribution Q, we can define the 
minimax rate 

mn{e{V),^op,Q) :=inf supEp,Q h{p{9{Zi, . . . , Zn),9{P)))\ , (4) 

e Pep L J 

where we take the supremum (worst-case) over all distributions P £ V, and the infimum is taken 
over all estimators 9. For each a > 0, we can also define the set Qa to consist of all conditional 
distributions guaranteeing a- local privacy ([I]). By minimizing over all Q G Qa, we obtain what we 
refer to as the a-minimax rate for the family 9(V), 

dyin{9{V),<^>op,a):= inf 9Jt„(0(P), $ o p, Q) = inf inf sup Ep,q [$(p(^(Zi, . . . , Z„), 0(P))) 

(5) 

This quantity is the central object of the study in this paper: it characterizes the optimal rate of 
statistical estimation in terms of the privacy parameter a, in a uniform sense over the family 9(V), 
using the best possible estimator 9 and a-locally private conditional distribution Q. 



2.1 From estimation to testing 

A standard first step in proving minimax bounds is to reduce an estimation problem to a testing 
problem. More precisely, given an index set V of finite cardinality, consider a family of distributions 
{Pu,i^ G V} contained within V. This family induces a collection of parameters {9{Pu),i' G V}, 
which is said to be a 2(5-packing in the p-semimetric if 

p{9{Py),9{Py,)) > 26 for ah u / u'. (6) 

We use this family to define the canonical hypothesis testing problem: suppose that nature chooses 
a random variable V £ V uniformly at random, and that, conditioned on the choice 1^ = 1^, the 
random vector X = {Xi, . . . , Xn) is drawn from the n-fold product distribution P". In the classical 
setting, the random vector X is observed directly by the statistician. The additional twist provided 
by a local privacy constraint is that, for a given conditional distribution Q, we generate a new 
random vector Z = (Zi, . . . , Z„) by sampling each Zi from the distribution Q{- \ Xi, . . . , Xn) (in 
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many cases, this sampling is conditionally i.i.d., so we sample Zi according to Q{- \ Xj)). By 
construction, conditioned on the choice V = the random vector Z is distributed according to the 
marginal measure M'^^ defined in equation (l3|). 

Given the observed vector, the goal is to determine the value of the underlying index v. A testing 
function is a measurable mapping : Z'^ — )■ V, and its error probability is P('0(-^i, • • • , Zn) 7^ V\ 
where P denotes the joint distribution over the random index V and Z. The classical reduction 
from estimation to testing guarantees that, for any non-decreasing function $ : — t- M+, the 
minimax error previously defined (|4|) is lower bounded as 



9K„(e, $ o p, Q) > ^(S) inf P(V(Zi, . . . , Z„) / y). 



(7) 



where the infimum ranges over all testing functions. 

Following this reduction, the remaining challenge is to lower bound the probability of error in 
the underlying multi-way hypothesis testing problem. There are a variety of techniques for this, 
and we focus on two powerful bounds on the probability ([7]) of error, due to Le Cam and Fano. Le 
Cam's inequality (see, e.g. Yu [s^ . Lemma 1] or Tsybakov 33, Theorem 2.2]) is applicable when 
there are only two values z/, v' in V. In this case, one has the bound 

1 1 „ . „ „ 

(8) 



infP(7/^(Zi,...,Z„) /F) > 



^!/IItV ' 



where the marginal M is defined as in the expression More generally, Fano's inequality 

(e.g. Yang and Barron 37, equation (1)] or Gray [j^ . Lemma 4.2.1]) holds when nature chooses 
randomly from a set V of cardinality larger than 2, and is 

/(Zi,...,Z„;F)-Hog2 



infP(V^(Zi,...,Z„) /y) > 



1 



log |V| 



(9) 



As a consequence of the inequalities ([8]) and ([9]) bounding the probability of error in the testing prob- 
lem, our main theoretical results focus on controlling the total variation distance ||M" — M< 



2 llTV 



or the mutual information between the random parameter index V and the sequence of random 
variables Zi, . . . , Z^. This control allows us to prove sharp lower bounds on the minimax risk ([5]) . 



3 Pairwise upper bounds under local privacy 

We begin with a relatively simple upper bound on the symmetrized Kullback-Leibler divergence 
under a local privacy constraint. We then develop some consequences of this result for both Le 
Cam's method and a local form of Fano's method. Using these methods, we derive sharp minimax 
rates under local privacy for estimating means in location families, as well as for fixed design 
regression. 



3.1 Pairwise upper bounds on Kullback-Leibler divergences 

Many statistical problems depend on comparisons between a pair of distributions Pi and P2 defined 
on a common space X . Any conditional distribution Q transforms such a pair of distributions into 
a new pair (Mi,M2) via marginalization 

Mj{A) := I Q{A I x)dPj{x) for j = 1,2. (10) 
Jx 
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Our first main result bounds the (symmetrized) KL divergence between these two induced marginals 
as a function of the privacy parameter a > associated with the conditional distribution Q and 
the total variation distance between Pi and P2- 

Theorem 1. Let Q be any conditional distribution that provides x with a- differential privacy. Then 
for any two distributions Pi and P2 on X , the induced marginals Mi and M2 satisfy the bound 

D^i (M1IIM2) + Dm (M2IIM1) < 4(e" - 1)^ \\Pi - PsHtv • (H) 

Remark: Note that for a < H we have the inequality — 1 < \/^a. Consequently, by applying 
Pinsker's inequality to the total variation distance between Pi and P2, Theorem [1] implies that 

P'ki(Mi||M2)+Dki(M2||Mi) <8a2||Pi-P2||2^ < 4q2 min {Z^ki (A IIP2) , I?ki (^'2||i^i)} (12) 

for a G [0, ||]. This inequality allows us to relate the symmetrized KL divergence between Mi and 
M2 directly to the KL divergences between Pi and P2. We can also use Pinsker's inequality to see 

\\Mi - M2||^v < 4a' 11^1 - ^2|Itv ' (13) 
for a G [0, ||], which allows us to relate the total variation distances directly. 

We provide the proof of Theorem [1] in Section [6l Here we develop a corollary that has useful 
consequences for minimax theory under local privacy constraints. Suppose that conditionally on 
V = ly, we form a random vector X = {Xi, . . . ,Xn) by drawing each Xi independently from a 
distribution P^^i. Given the a-locally private conditional distribution Q (recall definition ([l])), form 
the random vector Z = {Zi, . . . ,Zn) by sampling Zi from Q{- \ Xi:n)- Conditioned on V = u, 
the random vector Z is distributed according to the measure M" as defined earlier Note that 
because we allow interactive protocols, this is not necessarily a product distribution, even though 
we enforce a-local privacy. 

Corollary 1. For any conditional distribution Q that guarantees a-local differential privacy and 
any pair of distributions Pi, and Pyi , we have 

n 

D^i (M,"||M,",) + (M;,||M;) < 4(e° - 1)^ ^ ||P,,, - P..,.||^v • (14) 

i=l 

Moreover, for V uniformly distributed over the index set V, we have 

n 

/(Zi,...,Z„;y)<2(e"-l)2J]— 2 11^-,^ " ^-',^IItv • (1^) 

i=i ' ' vyev 

See Section 16.21 for the proof, which requires a few intermediate steps to obtain the additive in- 
equality. The bound (jlSp follows directly from the inequality (jl4p . In particular, if we define the 
mean distribution M = jyy Ylue.v then by the definition of mutual information, we have 

i{Zi, . . . , z„; y) = ^ Dki {m^WaT) . 

The joint convexity of the KL divergence implies that Dki (M"||M") < -p^ Z^^^'gv -^ki {M^\\M^,^, 
and applying to the pairwise terms yields the claim p5]) . 
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3.2 Consequences for minimax theory under local privacy constraints 

We now turn to some consequences of Theorem [1] for minimax theory under local privacy constraints. 
For ease of presentation, we assume a fully i.i.d. sampling model, i.e., P^^i = Py for i = 1, . . . ,n. 
(All of our results generalize naturally to the independent but non-i.i.d. setting.) We show that in 
both Le Cam's inequality and the local version of Fano's method, the price of a-local differential 
privacy is a reduction in the effective sample size from n to Aa^n. 



Consequence for Le Cam's method: Our theory has an immediate consequence for Le Cam's 
method, which yields a lower bound on the minimax error in terms of a binary hypothesis test. 
The classical (non-private) version of Le Cam's method applies to the usual minimax risk 

mn{e{V),^ o p) := irif sup Ep \^{p{e{Xi,. . . , Xn), 9{P))) 

for estimators that are functions of Xi, . . . One version of Le Cam's lemma ([8]) asserts that, 
for any pair of distributions {Pi,P2} such that p{6{Pi),6{P2)) > 26, we have 

Tln{e{V),<^ o p) > ^5) - ^ VnDu (P1IIP2)}. (16) 

Now let us return to the a-locally private setting, in which the estimator 9 must depend only 
on the private variables (Zi, . . . , Z„), and we measure the a-private minimax risk ([5]). By applying 
Le Cam's method to the pair (Mi,M2) along with Theorem [1] in the form of inequality (jl3p . we 
find 

mn{e{V),<^>op,a) > ^5) |i--l=^4na2Z)ki(Pi||P2)} foraG[0,i]. (17) 

By comparison with the original Le Cam bound (jl6p . we see that for a G [0, the effect of a-local 
differential privacy is to reduce the effective sample size from n to 4a^n. We illustrate the use of 
this a-private version of Le Cam's bound in our analysis of the location family problem to follow. 



Consequences for local Fano's method: We now turn to consequences of the so-called local 
form of Fano's method. It is based on constructing a family of distributions {Pu,t^ G V} that 
defines a 2(5-packing, meaning that p{9{Pu),0[Pyi)) > 25 for all u ^ u' , additionally satisfying 

Dv^{PAPu')<K^5^ (18) 

for some fixed k > 0. Recalling Fano's inequality @, we note that by a convexity argument, the 
pairwise upper bounds (fTHj) imply I{Xi, . . . , X„; V) < nK^S"^ . We thus obtain the local Fano lower 
bound 371, la] on the classical minimax risk, namely 

JlK^J^ + log 2 



Ti^{e{v),^ o p) > m {1 - t,7 }• (19) 



log |V| 

Returning to the a-locally private setting, suppose that we wish to lower bound the a-minimax 
risk ([5]). By Pinsker's inequality, the pairwise bound (I18p implies that \\Pu — -Pi^'IItv 

< 1^^252 for 

all V ^ v' . Combining this inequality with the upper bound (|15p from Corollary [H we find that 
/(Zi, . . . , Z„; y) < 2n(e" - Xfn^b^ < Ana^K^ 6^, for a G [0, 23/35]. 
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Consequently, by the Fano inequality, we obtain the a-private version of the local Fano bound: 

Ana^K^S"^ + log 2 



93T„(e,$o/,,a) > $(5) 



1 



log|V| 



(20) 



Once again, by comparison to the classical version (jl9|) . we see that, for all a G [0, the price for 
privacy is a reduction in the effective sample size from n to 4a^n. 

3.3 Some applications of Theorem [1] 

In this section, we illustrate the use of the a-private versions of Le Cam's and Fano's inequalities. 
First, we study the problem of mean estimation in location families; in addition to demonstrating 
how the minimax rate changes as a function of a, we also reveal some interesting (and perhaps 
disturbing) effects of enforcing a-local differential privacy. Our second example studies fixed design 
linear regression, where we again see the reduction in effective sample size from n to a^n. 

3.3.1 Location family models 

Let us begin with mean estimation in location families. In particular, for some /c > 1, consider the 
family 

Vk := {distributions P such that Ep[X] G [-1, 1] and Ep[|X|^] < l}, 

and suppose that our goal is to estimate the mean 9{P) = Ep[X]. In this section, we characterize 
the a-private minimax risk in squared Euclidean distance, 

^n{e{Vk), (•)', a) := inf inf sup E Ue{Zi, . . . , Z„) - e{P)f] . (21) 

Proposition 1. There exist universal constants < q < c^t < oo such that for all k > 1, the 
minimax error (1211) is bounded as 



Qminjl, (na^) | < aK„(0(Pfc), (•)^ «) < c« min |l, max {l, (fc - l)"^} (na^) (22) 

We prove this result using the a-private version p7|) of Le Cam's inequality; see Section [6.31 for the 
details. 

In order to understand Proposition [U it is worthwhile considering some special cases, beginning 
with the usual setting of random variables with finite variance {k = 2). In the non-private setting 
(where the original samples {Xi, . . . , Xn) are directly observed), the sample mean = ^ Sr=i -^i 
has mean-squared error at most 1/n. However, when we require a-local differential privacy, then 
Proposition [T] shows that the minimax rate is reduced 1/V na^. More generally, for any k > 1, 
the minimax rate scales as Tln{0{Vk),i-)'^,ce) x (na^)~~^, ignoring fc-dependent pre-factors. As 
k t oo, the moment condition E[|X|^] < 1 becomes equivalent to the boundedness constraint 
|X| < 1 a.s., and we obtain the more standard parametric rate {na^)~^. Here there is no reduction 
in the exponent, but rather only the reduction in effective sample size from n to a^n. 

More generally, the behavior of the a-private minimax rates (j22|) helps demarcate between 
situations in which which local differential privacy may or may not be acceptable. In particular, for 
bounded domains — where we may take k to oo — local differential privacy may be quite acceptable. 
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However, in situations in which the samples take values in unbounded spaces, then differential 
privacy provides much stricter constraints, forcing estimators to suffer substantially. Intuitively, 
the constraint that for x — t- oo and x' — ?• — oo we must have Q{S \ X = x)/Q{S \ X = x') G [e~", e"] 
for any measurable set S is quite strong. Indeed, in Appendix [Aj we discuss in an example that 
illustrates the pathological consequences of providing (local) differential privacy for non-compact 
spaces. 



3.3.2 Linear regression with fixed design 

We turn to now to the linear regression problem. To make this case concrete, we assume we have 
a known design matrix X G M"'^'^ and the observation model 

Y = Xe* + e, (23) 

where e € M"" is a sequence of independent, zero- mean noise variables. For simplicity, we assume 
that we seek to estimate 9* G Q = {6 G \ \\0\\2 < 1}, the ^2-ball of radius 1 and that there exists 
a scaling constant a < oo such that the noise sequence \£i\ < a for all i. Given the challenges of non- 
compactness exhibited by the location family estimation problems in Proposition [H this assumption 
is hopefully not too obtrusive. We further assume that X~^ X is invertible, so we require that n> d. 

With the model ()23p in place, let us consider estimation of 9* in the squared ^2-iiorm. That is, 
we wish to give upper and lower bounds on the estimation error of 9 for 9* , based on differentially 
private views of the dependent variables using the expectation ^[116* — By following 

the outline established in Section [3.21 we can prove the following result. 



Proposition 2. Consider estimation in the fixed design regression model ()23p . where the variables 
Yi and £i are a-locally differentially private with a = 0(1). There exist universal constants < 
ci < Cu < CO such that 

I ' tr(lXTX)na^ / " i^' H -112 > «J < mm 1 1, 1. 

We provide the proof of Proposition [2] in Section 16.41 but some remarks may make it clearer. 
Let Pi{A) denote the ith singular value of a matrix A. By noting the matrix inequalities 

-X'^X <pi,^{X/^)h^d and -X~'X^pl,^{X/^)h^d, 
n n 

we have the inequalities 



tr (ix^x) < dp^^(X/V^) and tr (^(ix^x) < 



As a consequence, we see that under the conditions of Proposition [2] there exist universal constants 
< C£ < Cu < oo such that 

If the fixed design matrix X satisfies the orthonormality condition ^X~^ X = Idxd^ then this gives 
the minimax rate 9Jt„(0, ||-||2 ,Oi) x a'^dj (nc?\ Comparing to standard minimax rates for linear 
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regression problems, which scale as a'^d/n, we see that requiring differential privacy indeed causes 
an effective sample size reduction from n to na^ . 

Up to differences in scaling in the maximum and minimum singular values of the design X, we 
have completely determined the minimax rate for fixed-design linear regression under a differential 
privacy constraint. Moreover, as the proof makes clear, the upper bounds are attained by adding 
Laplacian noise to the dependent variables Yi, then solving the resulting normal equations as in 
standard linear regression. 



4 Variational bounds on mutual information under local privacy 

In this section, we turn to a more general and powerful upper bound on the mutual information. As 
we have previously noted, Theorem [1] can be used to obtain indirect upper bounds on the mutual 
information, but the resulting bounds all involve pairwise distances only, as in Corollary [H so that 
these bounds must be used with local packings. Exploiting Fano's inequality in its full generality 
requires a more sophisticated upper bound on the mutual information under local privacy, which 
is the main topic of this section. In Section [5] to follow, we show how this upper bound can be used 
to derive sharp minimax rates for the problem of convex risk minimization under local privacy. 

We begin with some definitions needed to state the result. Let be a discrete random variable 
uniformly distributed over some finite set V. Given a family of distributions {Pu^v € V}, we define 
the mixture distribution 

If V is sampled uniformly from V, and conditional onV = v the random variable X has distribution 
Pjy (meaning that X ~ P) , then by definition of mutual information 

a representation that plays an important role in our theory. As in the definition (l3|), any conditional 
distribution Q also induces the marginal family {My^u G V}, as well as the associated mixture 
distribution M := ^"^^^y M^. Our goal is to upper bound quantities related to the mutual 
information I{Zi, . . . , Zn',V), where the random variables Zi are drawn according to My- 

Our upper bound is variational in nature, meaning that it involves optimization over a set of 
functions Ga C L'^iX), where we recall that := {/ : A" M | sup^g;^. \f{x)\ < oo). In 

particular, for a given a > 0, we define 

g^{X) := {7 G L°°(Af) : 7(x) G [e"" - e",e" - e'"]/! for all x € X} . (24) 

This set describes the maximal amount of perturbation allowed in the conditional Q for any fixed 
X £ X. Since the set Af is generally clear from context, we typical omit this dependence. Finally, 
for each 1/ G V, we define the linear functional 99^ : L°°(Af) — t- M by 

Ml)= [ l{x){dPy{x)-dP{x)). 



J X 

With these definitions, we have the following result: 
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Theorem 2. For a given a E [O,log(^ + ^\/3)), let Q be a-locally private ([T|) for samples X ^ X. 
For any collection {Pu,i^ € V} of probability measures on X, we have 

^ V [i?ki (M,||M) + Dki (M||M,)] <C^^ sup V (¥^.(7))' , 
where ■= 4 (e"" - 2(e° - 

Remark: We require the upper bound a < log(^ + ^\/3) ~ 0.31 to ensure that Ca is finite. 
An inspection of the proof of Theorem [2] shows that if ||P,y — < t for G V, we may take 

C„ = 4/(e-" - 2t(e° - 1)) and similarly allow a < log(i + ^^5±^) « log(i + 1/V2t). 

Up to constant factors, Theorem [2] is never weaker than the results provided by Theorem [H in 
particular, the bounds on the mutual information from Corollary [TJ Let us see how a weakened 
form of Theorem [2] yields that type of bound: 

Corollary 2. Under the conditions of Theorem\^ there is a univeral constant c < 19 such that 

n 

I(Zi,...,Z„;y)< c(e°-l)2^ — 2 E II^-'^-^-''*IItv /or a G [0, 1/4]. 

i=i ' ' uyev 

Proof. We begin with an immediate weakening of the variational bound in Theorem [2] — namely 

sup {M7)f < M E W)' • (25) 

The inner supremum is attained by setting ■j{x) = (e" — e^")/2 for x such that (abusing notation 
somewhat) dPi,{x) > dP{x), while j{x) = (e~" — e°)/2 otherwise. By inspection, this yields 

sup iMl)f = [^-^f^ 11^- -^IItv)' = i - 11^- -^IItv < - 1)' 11^- -^IItv ' 
since e° — e~" < 2(e" — 1). Since Ca < 19 for a € [0, 1/4], we have consequently shown that 
E [^ki (M,||M) + Dki (M||M,)] < C„(e° - 1)^^ - P||^^ 

< 19(e° - l)2-i^ Yl II^.-^v'IItv □ 



The strength of Theorem [2] arises from the fact that the inequality (125p — where we interchange the 
order of the supremum and summation — may be quite loose. 

Now we present two corollaries that extend Theorem [21 First, we have a bound using all pairs 
of the packing members. 

Corollary 3. Under the conditions of Theorem\^ we have 

^ Y [Du (M,||M) + D^i {M\\M,)] <CaT^ sup ^ ((/..(t) - ^u'h)? ■ 
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This claim follows immediately from convexity, since {<fu{l))'^ < jpy Si/'(=v(¥'!^(7) ~ '^y'il))'^ ■ 

We can also provide a result analogous to Corollary [H which allows us to apply the minimax 
lower bounds outlined in Section [2.11 This corollary shows concretely that, so long as the released 
data Zi is a-differentially private for the original samples Xj, we may bound the information 
available to any statistical procedure using the geometry of the packing set V. 

Corollary 4. Let V he distributed uniformly at random in V , and assume that given V = v, the 
samples Xi are sampled independently according to the distributions P^^i for i = 1, . . . ,n. Define 
Pj = -p^ ^"^ev ^^^i ^''^^ linear functionals ip^^i : — )• M 6?/ 

'■= / 7(a;) [dPu,i{x) - dPi{x)) . 
Jx 

If for each i, Zi is a-differentially private for Xi, then in the notation of Theorem{^ 

n ^ 

I{Zi, . . . , V") < V — sup V {(p,y,i{'y)f ■ 

We provide the proof of Corollary [J] in Section \7.2\ the proof follows similar arguments to those 
used to prove Corollary [H 

Theorem [2] and Corollaries [3] and prelate the amount of mutual information between the random 
perturbed views Z of the data to variational properties of the underlying packing V of the parameter 
space G. In particular, Theorem [2] and Corollary [3] show that if we can find a packing set V that 
yields linear functionals whose sum has good "spectral" properties — meaning a small operator 
norm when taking suprema over L°°-type spaces — then we can provide sharper results. 



5 Convex risk minimization under local privacy 

The notion of minimizing a risk functional lies at the heart of decision-theoretic statistics, dating 



back to the seminal work of Wald |34l ]. In practice, it is most attractive to minimize convex 
functions, and thus, convex risk minimization provides a natural setting in which to illustrate the 
power of Theorem [21 In earlier work we studied the problem of privacy preservation under convex 



risk minimization via a computation of saddle points of the mutual information [ll|]. The results 
presented here are more general, and the proofs are more direct, since Theorem [2] allows us to 
circumvent the saddle point characterization that played a central role in the earlier paper. 



5.1 Problem formulation 

Given a compact convex set C M'^, our goal is to find a parameter value ^ € achieving good 
average performance under a loss function £ : x R'^ — )• ]R+. Here the value i{x,9) measures the 
performance of the parameter vector G on the sample x €z X, and £{x, •) : M*^ — )■ ]R+ is convex 
for X £ X. We measure the expected performance of G via the risk function 

R{e) :=Kp[£{X,9)], (26) 

where the expectation is taken over some unknown distribution P over the space X. With On 
denoting an estimator based on the perturbed samples Zi, we explicitly quantify the rate of conver- 
gence of R{6n) to infgge R{6) as a function of the number of samples n and the amount of privacy 
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preserved by releasing the privatized data {Zi}^^^ as opposed to the initial samples {Xj}"^^. 

In order to state our results, we require some definitions related to function classes and risks. 

Definition 1 ((Uniform) Lipschitz continuity). For a given x £ X, the function 6 i— )■ i{x,6) is 
-L-Lipschitz continuous with respect to the ip-norm if 

\e{x,9)-i{x,e')\<L\\9-9'\\^ for 9,9' ee. (27) 
The loss function i is X -uniformly {L,p)- Lipschitz continuous if inequality (j27p holds for all x £ X. 



For future reference, we note that the Lipschitz condition (I27p is equivalent 26f| to imposing a 
boundedness condition on the subdifferential in the iq-norm, where 1/p + 1/g = 1: for any vector 
^ G M'^ in the subdifferential d0i{x,9), we have \\g\\g < L. We use ||96»^(a^, ^) < L as shorthand 
for this condition. Consequently, the loss function i is ^Y-uniformly L-Lipschitz continuous with 
respect to the ^p-norm if and only if 

sup\\dee{x,9)\\<L. (28) 

We now turn to the minimax error that we study in the context of convex risk minimization. 
Let Ai denote any statistical procedure or method for minimizing R, and let 9n denote the output 
of M after receiving the n private samples Zi, . . . , Zn- The excess risk of the method M for R is 

eniMJ,Q,P) := R(9n) - mf ^(e) = Ep[£(X,^„)] - inf Ep[^(X, ^)]. (29) 



The excess risk (|29p is a random variable, since the output 0„ of the method is random: it depends 
on both the random variables Xi and their (randomly) masked versions constructed via the channel 
distributions Q. We thus take the expectation and measure the expected sub-optimality of the risk 
according to P and Q. We let £ denote a collection of loss functions, where for a distribution P 
on X, the set £(-P) denotes the losses i : suppP x — > IR_|. belonging to The minimax error is 
then given by 

e:(£,G,a) := inf sup Ep,Q[e„(A^, ^, 9, P)], (30) 
•^''3p,te£(P) 

where the expectation is taken over the random samples X ~ P and Z ~ Q{- \ X) and the infimum 
is taken over all inference methods and a-locally differentially private ([T]) distributions Q. 

5.2 Minimax lower bounds for private convex optimization 

We now characterize the minimax rates for convex risk minimization problems under a- local privacy. 
Each of our propositions considers minimization of convex, Lipschitz-continuous loss functions over 
a domain © C M'^. 

Our first lower bound applies to a class of functions Lipschitz with respect to the ^i-norm, 
where the optimization takes place over the ball Bi(r):={0GR'^ | ||^||]^<r}. We define the set 

£(Bi(r); L) := {i : X x Mi{r) ^ R \ £ is convex, A'-uniformly (L, l)-continuous}. (31) 

As a specific example, this loss class covers the problem of the multi-dimensional median, where 

i{x,9) = \\x-9\\^, 

as well as losses used to construct linear classifiers, such as the hinge loss i{x,9) = [1 — {x,9)]^. 
For this class, we have the following minimax rate: 
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Proposition 3. For the loss class £(]Bi(r); L) and privacy parameter a S [0, |], there are universal 
constants < q < < oo such that 



Qmin< ^^—= ,rL> < e„(i:,]Bi(r),Q;) < Cuinini ^^-^ ,rL}. (32) 



Proposition [3] provides a sliarp cliaracterization of the minimax rate up to the constant factors 
(q,Cm). It is worth noting that the non-private minimax rate for the class £(]Bi(r);L) is given by 



n 



(see Duchi et al. (lH . Theorem 1]). By comparison to the inequalities p2p . we see that a- local 
differential privacy has a dimension- dependent effect on the minimax rate: the effective sample 
size is reduced not simply from n to a^n, as in Section [3l but rather from n to a^n/d. In effect, 
requiring a-differential privacy is a stringent constraint in high dimensions: since all dimensions 
must be uniformly protected, the convergence rate suffers a significant penalty. 

We can also give a result for a larger class of domains and related optimization functions. 
Indeed, consider the loss class 

-£(0;L,p, r) := {I : X Q ^ M. \ I is convex, Af-uniformly (L,p)-continuous}, (33) 

for some p G [2, oo], and some set that contains the ball Boo(?") = € M'^ | \\0\\cy^ < t}. 

Proposition 4. For the loss class -2(0; L^p, r) from equation (I33p and privacy parameter a G [0, 
there exists a universal numerical constant < q such that 

Qmini — ^^^^,rLl < £,;(£, 0, a). (34a) 



a y n 

If in addition C {6* G M*^ | ||6'||2 < Cr\fd] for some (absolute) constant C, there exists a universal 
numerical constant Cu G [q,oo) such that 

e;(£,0,a) < c„min I — ^^:^,rLl . (34b) 



As with Proposition [3l the inequalities (I34p provide a characterization of the a-private minimax 
rate that is tight up to constant factors. Again, it is worthwhile to relate this minimax rate to the 
non-private setting: from Theorem 1 of Agarwal et al. [l|, the non-private minimax rate for the 
function class -C(0; L,p, r) is lower bounded by noting that if C c[—r, r]'^ for some constant 

c then C c^/dM2{r) shows that the bound (|34bp is sharp. Consequently, the price for a-privacy 
is again a reduction in effective sample size by the dimension-dependent factor o? jd. 

Proposition S] has an interesting corollary in application to convex risk minimization problems 
over the £q-norm balls of the form 

"^qivq) := {l9 G M"' : ||l9|| < rj, where q G [2, oo]. 
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For any such ball, Proposition H] may be applied with r = d irq, since with this choice of r, we 
have Moo{r) C Mq(rq). Putting together the pieces, let us define the function class 

£(Bg(rq); := {£ : X x Mq{rq) — ;> R | ^ is convex, ^-uniformly (L,p')-continuous} (35) 

for some p' G [2, oo]. We then have: 

Corollary 5 (Minimax rates over £q-balls). For the class £,{Mq{rq); L,p') from equation ()35p with 
q € [2,oo], there exist universal (numerical) constants < q < c,j < oo such that 

I— i_i I— i_i 
VdrqLd'' 1 */oTK^ ^ ^ ^ VdrgLd'^ i 
ce — < e„(£,Bg(rg),a) < c„ — . (36) 



From past work (see equation (11) in the paper [l|]), the non-private minimax risk for the function 

class (|35|) scales as Lrqd^ '< j \fn. Once again, we see that the effect of imposing a-local differential 
privacy is to reduce the effective sample size has been reduced from n to no? jd. 



5.3 Matching upper bounds by stochastic mirror descent 

We provide the proofs of the lower bounds in Propositions [3] and [4] in Sections 18.21 and 18.31 respec- 
tively. They are based on a combination of Theorem [2] with Fano's method. In this section, we 
describe how the matching upper bounds can be achieved using simple and practical algorithms — 
namely, stochastic gradient descent and their non-Euclidean generalizations [2^, 0, [s^l — along with 
the "right" type of stochastic perturbation to guarantee a-local differential privacy. We note that 
these algorithms require interactive privacy mechanisms, as they iteratively process the data. 

We first give a brief review of (stochastic) mirror descent algorithms. Given a differentiable 
convex function : M*^ — )• M, we may define the Bregman divergence associated with ^ via 

D^{u,v) := tp{u) — ip{v) — {ViIj{v),u — v) >0. 

For instance, the function ip{u) = ^ ||u||2 generates the usual Euclidean distance. Other choices of 
Bregman divergences are useful for problems with non-Euclidean geometries (e.g., the Kullback- 
Leibler divergence for optimization over probability simplices). 

Given a fixed Bregman divergence and some initialization 9^ £ Q, the stochastic mirror descent 
algorithm generates a sequence of random iterates {6*}^i as follows. At iteration t, the algorithm 
maintains its current estimate 9^ and receives a vector gt G that is an unbiased estimate of a 
subgradient of the risk function R (i.e., E[gt \ 9^] G dR{9^)). Using these quantities, it performs the 
update 

= argmin {r? {gt, 9) + D^{9, 9')] , (37) 
See 

where ry is a stepsize that parameterizes the algorithm. As a special case, when the Bregman 
divergence is the Euclidean distance, the mirror descent update ([37|) is equivalent to the usual 
projected subgradient algorithm. See the papers [1, IsO] for a detailed analysis of the convergence 
properties of these algorithms, as well as Appendix |Dl where we present our formal analysis. 

The second ingredient of an implementable scheme is a conditional distribution Q that satisfies 
a-local differential privacy. We construct Z by perturbing the random vector g to construct an 
appropriate random vector Z £ satisfying E[Z | g\ = g. Our proofs use one of two sampling 
strategies, each of which involves a scalar bound B G M_|_ that we specify later. In addition, we 
define the bias probability tTq := e"/(e" -|- 1) and let T be a Bernoulli(7rQ)-random variable. 
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(a) 



(b) 



Figure 1. Private sampling strategics, (a) Strategy pSap for the ^2-ball. Outer boundary of 
highliglited region sampled uniformly with probability e"/ (e" + l). (b) Strategy (j38b|) for the £oo-ball. 
Circled point set sampled uniformly with probability e"/(e" + 1). 



Two methods for a-private conditional sampling: 

Strategy A: Given a vector g with II5II2 < L, set g = Lg/ \\g\\2 with probability | + \\g\\2 /2-L and 
g = —Lg/\\g\\2 with probability ^ — 115^112 /2L. Then sample T and set 

(\Jmform{z€R^:{z,g)>0,\\z\\2 = B) if T = 1 
Z ~ < , (00a) 

[\]mfoTm{z eM.'^ : {z,g) <0,\\z\\2 = B) if T = 0. 

Strategy B: Given a vector 5 with \\g\\^ < i^, construct ^ G M'^ with coordinates gj sampled 
independently from {—L, L} with probabilities 1/2 — gj/{2L) and 1/2 + gj / {2L). Then sample 
T and set 

^^{l]mioTxn{z(^{-B,BY:{z,g)>d) if T = 1 
juniform(z G : < 0) if T = 0. 



Remark: By inspection of the sampling strategies ()38ap and ()38bp . each is a-differentially private 
for any vector satisfying WgW^ < L or \\g\\^ < L, respectively. Moreover, each sampling strategy 
can be implemented efficiently: the first by normalizing a random A^(0, J^xd) sample, the second 
by rejection sampling over {—B,B}'^. See Figure [1] for visualizations of the sampling strategies. 

Our approach is to apply the sampling strategies (|38a|) and ()38bp . coupled with the mirror 
descent method (j37p . to develop a-locally differentially private algorithms for convex risk mini- 
mization. In each case, our algorithm is as follows. At iteration t of the algorithm, a stochastic 
gradient, gt G dg£{Xt,9^), of the tth datum is computed, after which a vector Zf is sampled accord- 
ing to either the distribution ()38ap or ()38bp with the property that E[Zt | gt] = gt- We then apply 
mirror descent with these a-differentially private stochastic gradient estimates Zt- 
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In Appendixini we show that the samphng strategy ()38bp . with appropriate choices of B, yields 
the upper bound in Proposition [3l Here we state a detailed convergence result that achieves the 
upper bound stated in Proposition [H 

Proposition 5. Assume that C {0 G M'^ : \\0\\2 < r2}, that i is L-Lipschitz with respect to the ip- 
norm for some p E [2,oo], and a < 1. Let Zt be generated according to the sampling scheme (I38ap 
starting from the stochastic gradient vector gt with 



B = L 



e° + 1 V^dr(^^ + 1) 



1 r(^ + i) 

|2 ■ 



^2 

Then stochastic gradient descent (the update ()37p with iIj{9) = ^ ||^||2j achieves convergence rate 



w.[R{en)]- R{e*) < c^''^^ 



a Wn 



See Appendix ID. 31 for the proof of this result. 

A few minor remarks are in order here. To get a sharp upper bound to match Proposition 
we note that if the loss £ is ^Y-uniformly (L,p)-Lipschitz for p G [2,oo], then for g G dQi{x,6) and 
q conjugate to p, i.e., 1/p + l/q = 1, we have \\g\\2 < \\g\\q 1^ L. As a consequence, the sampling 
strategy (j38ap applies naturally. Continuing, we note that if C C[—r,r]'^ for some (absolute) 
constant C, then the bound ||^||2 < -v/d||^||oQ implies Q C {9 : \\9\\2 < CVdr}. Consequently, 
Proposition [5] implies the upper bound 



a \ n 



which matches the bound (|34bp in Proposition H] precisely. 



Additionally, it appears that the standard strategy 17|, ll5|] of adding Laplace noise is sub- 
optimal for these convex risk minimization problems. While we have not provided a formal lower 
bound in our minimax framework, to privatize vectors g ^ such that \g\2 — 1 t)y addition 
of independent Laplace noise, one must add vectors W ^ whose coordinates are distributed 
as Laplace (a/\/d). In this case E[||Vl^||2] = d'^/a'^, which yields a convergence guarantee of 
0{r2Ld/ VnoP) under the conditions of Proposition [Sj the noise is 0{d) too large. The more 
careful sampling strategies (j38ap and ()38bp avoid this additional dimension dependence. 

6 Proof of Theorem [1] and related results 

We now turn to the proofs of our results, beginning with Theorem [T] and related results. In all 
cases, we defer the proofs of more technical lemmas to the appendices. 

6.1 Proof of Theorem [1] 

Observe that Mi and M2 are absolutely continuous with respect to one another, and there is 
a measure /x with respect to which they have densities mi and 7712, respectively. The channel 
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probabilities Q{- \ x) and Q{- \ x') are likewise absolutely continuous, so we may assume they have 
densities q{- \ x) and write mi{z) = J q{z \ x)dPi{x). In terms of these densities, we can write 



Z)ki (M1IIM2) + i^kl {M2\\Mi) = j miiz) log 



mi{z) 
TT' 



dfi{z) + / m2{z) log 



7712(2) 
mi{z) 



dfi{z) 



{mi{z) - m2{z)) (logmi(z) - log m2{z)) dn{z) . 



Consequently, we need to bound both the difference mi — m2 and the difference of the logarithms. 
To this end, we state two lemmas: 



Lemma 1. For any a-locally differentially private conditional, we have 

|mi(z) - m2{z)\ < 2mf q{z \ x) (e" - 1) ||Pi - Pallxv • 



(39) 



We provide the proof of this claim at the end of this section. The following elementary lemma, 
proved in Appendix |El is useful for controlling the log differences: 

Lemma 2. Let a,6, c G M with max{|c|,|6|} < a. Then 



log 



a + b 



a + c 



< 



\b-c\ 



a — max{|6|, |c|} 



We use Lemmas [T] and [2] to complete the proof of the theorem. We begin by making note of 
the elementary relation 



mi{z) = / q{z I x)dPi{x) 



1 



i{z I x) {dPi{x) + dP2{x)) + I q{z\x) {dPi{x) - dP2{x)) 



along with the analogous equality for m2 with the roles Pi and P2 reversed. Combining these two 
equalities, we find that the log ratio can be written as 

1 mi{z) , \jq{z\x){dPi{x)+dP2{x)) + \jq{z\x){dPi{x)-dP2{x)) 
log Z-T7T = log T 



7712(2;) 



< 



2 / g(2 I x) {dP2{x) + dPi{x)) + lJqiz\x) {dP2{x) - dPi{x)) 

\!q{z\x) jdPiix) - dP2ix))\ 

^Jq{z\ x) {dPi{x) +dP2{x))-l\f q{z \ x) {dPi{x) - dP2{x))\ 
_ \'mi{z) - 7772(2)1 

~ i /g(2 I x) {dPi{x) + dP2{x)) - i 1/ q{z I x) (dPiix) - dP2{x))\ ' 

where the inequality follows from Lemma [2j Applying inequality ()39p from Lemma [1] to bound the 
numerator, we find that 



log 



7711(2) 



7772(2) 



< 



2(e"-l)||Pi-P2|lTV mfxg(^b) 



i /(7(2 I x) (dPiix) + dP2ix)) -l\Jq{z\x) (dPiix) - dP2ix))\ 
Noting that 

\Jq{z\ x) {dPiix) + dP2{x))-'^ j q{z \ x) {dPi{x) - dP2{x)) 



> min {7771(2), 7772(2)} > mf q(z \ x) 
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we obtain the bound 



log 



mi{z) 



7722(2) 



<2(e"-l)||Pi-P2|lTV 



Combining this with our inequahty (1391) . yields 

Dki (M1IIM2) + (M2IIM1) < 4 (e° - 1)2 - P2IITV / (/(^ I x)dfi{z) 

The final integral is at most 1, which completes the proof of the main theorem. 
It remains to prove Lemma [H For any z € Z, we have 



mi{z) — m2{z) = / q{z \ x) [dPi{x) — ^^2(2^)] 
Jx 



q{z I x) [dPi{x) - dP2ix)]^ + q{z\ x) [dPi{x) - dP2{x)]_ 
X JX 

< supg(z I x) / [dPi{x) - dP2{x)], + inf q{z \ x) / [dPi{x) - dP2{x) 
xex Jx Jx 

sup q{z I x) — inf q{z \ x) ] [dPi{x) — dP2{x)]: . 
x€X ^e-^ / Jx 



By definition of the total variation norm, we have J [dPi — dP2]^ = \\Pi — -P2IITV' hence 

|r7ii(z) - 7712(2)1 < sup \q{z I x) - q{z \ x')\ \\Pi - P2IITV ■ (^0) 

x,x' 

For any x £ X, we may add and subtract q{z \ x) from the quantity inside the supremum, which 
implies that 



sup \q{ 



z X — q[z X 



') I = inf sup \ q{z \ x) — q{z \ x) + ' 



[z xj 



q{z I x')\ 



< 2 inf sup \ q{z \ x) 

X X 

= 2 inf q{z \ x) sup 



[z X] 



[z I x) 



q{z I x) 

Since for any choice of x, x, we have q{z \ x)/q{z \ x) S [e~°, e"], we find that (since e" — 1 > 1 — e~°) 

sup \q{z I x) — q{z \ x)\ < 2 inf q{z \ x) (e" — 1) . 
x,x' ^ 

Combining with the earlier inequality (I40p yields the claim (1390 . 
6.2 Proof of Corollary [l] 

Recall that M" denotes the induced marginal distribution ([3|), which is defined for A G cr(Z^) by 
M^{A) = Q{A I xi.,n)dP^{xi:n). For each i = 2, . . . , 77, we let 



M^^i{- \ Zi = zi, . . . , Zi^i = Zi^i) = My^i{- I Zi.i^i = zi.i^i) 
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denote the (marginal over Xi) distribution of the variable Zi conditioned on Zi = zi, . . . , Zi-i = Zi-i. 
In addition, use the shorthand notation 

Di,i (M^,,||iVV,i) := / Dki (M^,i(- I = zi;,_i)||M^/,i(- | = zi-a^i)) dNp-^izi, . . . ,Zi_i) 

to denote the integrated KL divergence of the conditional distributions on the Z^. By the chain-rule 
for KL divergences [j^ . Chapter 5.3], we obtain 

n 

Dy,i (M;||M;,) = Y,Dm (M,,i||M,,,,) . 

i=l 

By assumption on the channel Q, we know that the distribution Qi{- \ Xi, Zi-i-i) on Zi is a- 
differentially private for the sample Xi. As a consequence, if we let Pu,i{- \ Zi = zi, . . . , = 
denote the conditional distribution of Xi given the first i — 1 values Zi, . . . , and the packing 
index V = i/, then from Theorem [1] and the chain rule we obtain 

D^, (M;||M,",) + Dki (M^, \m 

n „ 

< V4(e"-1)2 / I zv.-i)-P.',i{- I zv.-^i)\\l^dMl-\zi,...,Zi.i). 

i=i 

By the construction of our sampling scheme, the random variables Xi are conditionally independent 
given V = u] thus the distribution Pu,i{- \ = Pu,i, where P^^i denotes the distribution 

of Xi conditioned on V = u. This implies the equality ||-Pi/,j(- | ^^iii-i) — Piy',i{- \ •Zi:i-i)||rpY = 
\\P,y,i — Piy',i\\-j'Y 5 which yields the desired result. 

6.3 Proof of Proposition [1] 

The minimax rate characterized by equation (j22p involves both a lower and an upper bound, and 
we divide our proof accordingly. 



Lower bound: We use Le Cam's method to prove the lower bound in equation (I22p . First, fix a 
constant C > 0, whose value we specify later, and a constant 6 > 0, whose value we will also specify. 
We construct a 2(5(7^/^^"^ -separated set of two points that we must distinguish. Let V = { — 1,1}, 
and for u define 9,^ = vbC'^l^ , and define the distribution Py supported on {— C^/*^, 0, C^/'^} by 

Then by inspection, we have 

E^[X] = (Ji/C^A-i g^j^^ E4|X|'^]=5C. (41) 

We will later choose b and C such that both expectation values lie in [—1, 1]. Now, we see that 
0\ — 9^1 = 250"^ / , whence an application of Le Cam's method ([8]) and minimax bound ([7]) yields 

93Tn(e, (f ,Q) > [SC'"-')' Q - ^ \\M^ - M!!J^y) , 
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where denotes the marginal distribution of the samples Zi, . . . , Zn conditioned on 9 = 9y. 
Now we claim that for any a-locally differentially private channel Q, 



5_ 

'C 

|2 



|Mr-M!!j^^<(e"-l)-V^. (42) 



Indeed, Pinsker's inequality implies ||Mf - M!!i||^^ < i min{Z)ki (MJ^HM!! J , (Af!!i||MJ")}, 
and Corollary [1] yields 

min{Dkl (Mn|M!!i) ,Z)ki {M^^\M^)] < 2(e° - lfn\\Pi-P^i\\%^. 

Since, by construction, we have ||Pi — P_i||rpy = (5/C, we obtain the inequality ()42p . If a < 1, we 
have — 1 < 2a, and thus our earlier application of Le Cam's method implies 

!»i„(e,(.f,«)>(*c^/'-)Yl-^^). 



Let us assume that na^ > 1/16. By choosing 5 = C/(4V na^), we find that 1/2— a5y/n/C > 1/4, 
and thus 

Recalling the construction of the distribution on X and our equalities (j4ip . we must have S/C < 1, 
6C < 1, and ^C^/'^-i < 1. By our choice of 6, this requires C^/'^ < aVtw^ and < aVtw^. Since 
we assume na'^ > 1/16, we may take C = 2v^nc? and have C"^/^ < = 4\/ na'^. In this case, we 
obtain 

On the other hand, when na^ < 1/16, we take 6 = C = 1, which gives 5/C = 6C = bC'^!^-^ = 1, 
and moreover we have 

9JIn(e,(•)^a)>(l)2Q-V^a) >\. (43b) 
The combination of the bounds (j43ap and ()43bp yields the lower bound (j22p . 



Upper bound: We must demonstrate an a-locally private conditional distribution Q and an 
estimator that achieves the upper bound in equation (j22p . We do so via a combination of truncation 
and addition of Laplacian noise. Define the truncation function [•] : M — t- [— T, T] by 

[x]rp := max{— T, min{x, T}}, 

where the truncation level T is to be chosen. Let Wi be independent Laplace(a/(2T)) random 
variables, and for each index i = 1, . . . ,n, define Zi := [Xi]j, + Wi. By construction, the random 
variable Zi is a-differentially private for Xi. For the mean estimator 9 := ^ Yl^=i have 



E 

We claim that 



2 47^2 1 

Var(0) + ( E[0] - ) = — + - Var([Xi]y) + (E[Zi] - 9)' . (44) 



na^ n 



E[Z] = E [[X]^] G 



1 1 

(^ 



^[^]-^^-T)T^'^[^]+(A-T)T^ 



(45) 
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Indeed, by the assumption that E[|X|^] < 1, we have by a change of variables that 
Thus 

E[[X]y] > E[min{X, T}] = E[mm{X, T} + [X - T]_^ - [X - T]_^] 

l-OO 

= E[X] - / {x- T)dP{x) > E[X] 
Jt 



{k - l)T>'- 



A similar inequality holds for the upper bound (j45p . 

As a consequence, we use the bound and note that since [X]j, G [—T,T] and < 1, 



E 



< ^ + 



(A;- l)2r2fc-2: 



which holds for any choice of T > 0. Thus we may choose T to minimize the above bound, and 
taking T = {5{k - l))-2¥(na2)i/(2fc) gi^^s 



E 



(0-0? 



-1, 



5{5{k - 1)) — {na^)k 1 



< '— + 

r 

( 1 + 



na2 (A;-l)2(5(fc-l))-i+i/fc(na2)i-i/fc 
1 \ 1 



^-1/ (A;- l)i(na2)i-i 



Since (l + (/c-l)"i)(A;-l)"i < {k-iy^ + {k-iy^ for k e (1, 2) and is bounded by 1 + <2 
for k G [2, 00], we obtain the upper bound (p2]) . 

6.4 Proof of Proposition [2] 

Lower bound: We use a slight generalization of a-private form ()20p of the local Fano inequal- 
ity previously derived. For concreteness, we assume throughout that a € [0, ||], but analogous 
arguments hold for any bounded a with changes only in the constant pre- factors. We consider an 
instance of the linear regression model (|23|) in which the noise variables are drawn i.i.d. 

from the uniform distribution on [— o", +cr]. Our first step is to construct a suitable local packing 
of the unit sphere S"^~^ = {n G M'^ : ||ti||2 = 1} in ^2-iiorm. (See App endix IB . 1 1 for a proof.) 

Lemma 3. There exists a 1/2 packing V = {u^, . . . , v^} of the unit sphere S'^"^ such that 



N > 



exp(49(i/256) for d > 16 
exp(dlog(2)) for de [1,16], 



and 

AT / 

ildxd for d> 16 



j=l [d-^dxd 



ford€ [1,16]. 
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For a fixed 5 € (0, 1] to be chosen shortly, define the family of vectors {9^, G V} with O^, := 61^. 
Since ||z^||2 < 1, we have \\6i^ ~ &u'\\2 — Let Piy^i denote the distribution of Yi conditioned on 
9* = 9u. By the form of the linear regression model ()23p and our assumption on the noise variable 
Ei, Pp^i is uniform on the interval [{9u^Xi) — cj, {9u,Xi) + a]. Consequently, for / z^' G V, we have 



< 



\pu,i{y) -Pu',i{y)\dy 

77- 1 {9y,Xi) - {9^,,Xi) 1 + 7^1 {9u,Xi) - {9^,,Xi) 
la la 



2a 



\{9,-9,,,x,)\. 



Letting V denote a random sample from the uniform distribution on V, Corollary [T] implies 



n 

=1 ' ' p,u'e\p 



|2 

I TV 



1= 



2a 



-|V| 



Substituting 9y = 5v yields 

52(e"-l)2 1 



2c72 |V|2 



^ (z. - u')^X^X{u - u') = '^^^''"^ -^^^ tr f X^X Cov(y) 



a" 



where Cav(y) is the covariance of the vector V . Since Cov(y) ■< vu^ , Lemma [3] guarantees 

that tr {X'^XCowiV)) < liv{X'^ X), and hence 



da^ 



da'^ 



where the second inequality is valid for a G [0, Consequently, Fano's inequality implies that 



n {9{V, 



k)i Ir II2 1 ' 



a > - 1 



45^a^tr(X'X)/dcj^ +log2 
49(i/256 



(46) 



We split the remainder of the argument into two cases: 

Case 1: First, suppose that da/{AayJii{X^X)) < 1. Choosing 5 = da/{6a^/tr{X^ X)) then yields 



256 45^anr(X'X)/(iCT^ + log2 256 



49 d 49 

so long as d > 17. As a consequence, we have the lower bound 



log 2 ^ 1 



d 9 



4 

<5 



4-62 a'^tr{X'^X) 5' 
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which is the desired lower bound. (When d < 16, we again apply Fano's inequality to obtain the 
same result, but we have a packing of size at least exp((ilog 2).) 



Case 2: In the second case, we assume that da / {8a^y tT{X~^ X)) > 1. Choosing 5=1 then yields 
the bound 



(e.i 



12 



a > 



1 



^ _ d/8 + log2 \ J_ 
49d/256 y 32 



whenever d > 17. Again, for d < 16 we obtain the same result, but the packing size is exp((ilog(2)). 



Upper bound: We now turn to the upper bound, for which we need to specify a private con- 
ditional Q and an estimator 9 that achieves the stated upper bound on the mean-squared error. 
Let Wi be independent Laplace(a/(2(j)) random variables. Then the additively perturbed ran- 
dom variable Zi = Yi + Wi is a-differentially private for Yi, since by assumption the response 
Yi G [{6, Xi) — a, {9, Xi) + a\. We now claim that the standard least-squares estimator of 9* achieves 
the stated upper bound. Indeed, the least-squares estimator is given by 

9 = {X^Xy^X'^Y = {X^ Xy^X^ {X9* +e + W). 



Since W and e are independent, we have 



E 



r\\l 



E 



\\{X' XY^X^{e + W)\\ 



E 



\\{X^X)-^X'^e\\i 



+ E 



\\{x' xy^x'^w)\\ 



Since e £ [-a, cj]", we know that E[ee'^] ^ cr'^Inxn, and similarly Efl^W^"^] = (4(T^/a^)/„xn- 



Since a < 1, we thus find 

E 



*||2 
2 



< -^ti{x{x' xyx' = —tiiiX'X) 



which corresponds to the claimed upper bound with Cu = 5. 



7 Proof of Theorem [2] and related results 

In this section, we collect together the proof of Theorem [2] and its related corollaries. We defer the 
proofs related to convex risk minimization to Section [HI 



7.1 Proof of Theorem [2] 

Let Z denote the domain of the random variable Z. We begin by reducing the problem to the 
case when Z = {1,2,... ,k} for an arbitrary positive integer k. Indeed, in the general setting, we 
let JC = {Ki}'-^^ be any (measurable) finite partition of Z, where for z £ Z we let [z]fc = Ki for 
the Ki such that z E Ki. The KL divergence L'ki {My\\M) can be defined as the supremum of the 
(discrete) KL divergences between the random variables [Z]]q sampled according to M^, and M over 
all partitions /C of Z; for instance, see Gray [j^ . Chapter 5]. Consequently, we can prove the claim 
for Z = {1, 2, . . . , /c}, and then take the supremum over k to recover the general case. Accordingly, 
we can work with the probability mass functions m[z \ u) = My[Z = z) and rn{z) = M{Z = z), 
and we may write 

Dki {M,\\M) + Dki {M\\Mu) = V {miz \ u) - m{z)) log ""j!,', ''^ (47) 
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Throughout, we will also use (without loss of generality) the probability mass functions q{z \ x) = 
Q{Z = z \ X = x), where we note that 'm{z | i^) = / q{z \ x)dP^{x). 
Next we state a useful lemma: 

Lemma 4. Let q(- \ x) be an a- differentially private p.m.f. defined for all x £ X. There exists a 
probability mass function tvP on Z = {1,2, k} such that 



e "m'^{z) < q{z \ x) < e°'m}'{z) for z e Z and x e X. 



For each z/ E V, 



\m{z I i^)-m{z)\ < 2(e° - 1)||P^ - P||TV"^^-^) < 2(e" - l)m°(z). 



(48) 



(49) 



For the moment, we take the result of Lemma [Has given, and use it as well as Lemma [2] from 
the proof of Theorem [1] to complete the proof of Theorem [2j (We return to to prove Lemma H] at 
the end of this section.) Starting with equality (j47p . we have 



^ ^ [Dki {MAM) + Z^ki {M\\M,)] < E M ^ '""^^ ' ~ ^^'^ 



IV 



m(z I ly) 
log^ 



log 



m{z) 

rn(z) + {m{z \ — rn{z)) 
rn{z) 

El v-^ , / , X / M \m(z \ u) —rn(z)\ 
miz) -\miz\v) - m{z)\ 

Applying the inequality (|49p and the fact that rai^z) > e~°'mP{z), we derive the further upper 
bound (recall our choice of a < log(| + ^\/3), which guarantees that e~° — 2(e" — 1) > 0) 



X] jTH" '"^(^ I ^) -"i(^) 



1 



P E [^ki (M,||M) + Du {M\\M,)] < ^ ^ |m(z | u) - m{z) 



\m{z I u) — m{z) 



IV 



vev z=i 
4 



^ m 



e-"mO(z) - 2(e° - l)mO(z) 
(m(z I i^) — rn(z))'^ 



g-a _ 2(e" - 1) ^ |V| 



m''(z) 



It remains to bound the final sum. For any constant c G M, we have 

m{z I u) — rn{z) = I {q{z \ x) — c) {dPy{x) — dP{x)) . 

We define a set of functions f : Z x X ^M. (depending implicitly on m?) by 

:= {/ I f{z,x) G [e"°,e"]m°(z) for &\\ z ^ Z and x e X] . 

By Lemma m when viewed as a joint mapping from 2^ x — )• R, the conditional p.m.f. q satisfies 
{{z, x) ^ q{z I x)} G J^a. Since constant (with respect to x) shifts do not change the above integral, 
we can modify the range of functions in by subtracting m^{z){e" — e~")/2 from each, yielding 
the set 



K:={f \fiz,x)G [e- 



\z)/2 for all z £ Z and x £ X} . 
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As a consequence, we find that 

{m{z \ u) - m{z)f < sup \y^( f f{z,x){dP,{x)-dP{x))) \ 

t^v /ej-. \Jx J J 

= sup I ^ r / {f{z,x) - m\z)) {dP,{x) - dP{x)) 
f^^a {uev ^•^■^ 

By inspection, when we divide by m^{z) and recall the definition of the set Ga C -L°°(Af) in the 
statement of Theorem [2l we obtain 

2 



[m[z I u) — ?n(z^^^ 



Putting together our bounds, we have 

^ [Dki (M,||M) + Dki {M\\M,)] 



)f<{m\z))\upY,([ 7{x) {dPAx) - dP{x))) . 



|V| 



< 



E 



1 (mO(z))' 



e--2(e"-l)^|V| mO(z) .^c^ 



supY ( [ 7(x) (dP^(rE) - (iP(x)) 
reg. ^ \Jx 



J-sup J]f / 7(x) ((iP.(x)-dP(x)) 



by q{z) := ini^^x q{z \ x). 



e-" - 2(e° - 1) |V| ^ecx 
which is the desired statement of the theorem. 

We now return to proving Lemma HI Define the function q : Z 
Since q{z \ x) = 1 for all x, we have 

q{z) < q{z I x) < e"^q{z) and e~" < < 1. 

z 

We can now define the probability mass function mP{z) := q{z)/ q{z'). By construction 



e-°m°(z) = e" 



7T < < I x) < e^qiz) < e°m°(z). 



as claimed in equation (jl8|) . 

To prove the bound (jlUj) . we note that m{z \ v) — m{z) = Jp^q{z \ x) (ydPy{x) — dP{x)) and 
hence 



mi 



{z\u)- m{z) = / iq{z \ x) - m°(z)) {dP^{x) - dP{x)) 



X 

< / \q{z I x) - m^{z) \ \dP^{x) - dP{x)\ 
Jx 

q{z I x) 



< mPiyz) sup 

x^X 
,0/^\ in 



mP{z) 



1 



\dPy{x) - dP{x)\ 



X 



< m"(z) (e° - 1) / \dP^{x) - dP{x)\ < 27n"(z)(e" - 1), 
Jx 

where we used the fact that the total variation distance is bounded by 1. 
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7.2 Proof of Corollary [4] 

The proof of this corollary is similar to that of Corollary [TJ Indeed, as in the proof of Corollary [1] 
(recall Section [6.2p . we may define Mj, j(- | to be the distributions of the random variable 

Zi conditioned on Z\-i^\ = zi;i-i and V = v. Similarly, let Mj(- | zi-i-i) denote the average of 
the My^i over V. Applying the chain-rule for KL divergences 23, Chapter 5.3], we obtain as in the 
proof of Corollary [1] 

n 

= [^ki (M,,,(- I Zi,,_i)||Mi(- I + Dki (M,(- I Zi,i„i)||M,,,(- I . 

By construction of the M^^i, the conditions of Theorem [2] hold. Define the linear functionals 
I zi:i-i):L'^{X)^RVm 

By assumption, the samples Xi are conditionally independent given V = u, so in this case we have 
the equalities 

Pu,ii' I = = Pu,i{') ^ud Pi{- \ Zx:i-~\ = Zl;i^l) = Pi{-). 

Thus we find that ^u,i{n \ -^iii-i) = ^u,i{l) for 7 S L°°{X), where ^p^^i is defined as in the statement 
of the corollary. Applying Theorem [21 we thus find that 

^ J] [I^ki {Mu,i{- I Zi;,_i)||M,(- I +Dki (M*(- I Zi^,_i)||M,,,(- I 



< Ca sup -3- V ((/J^,i(7))^ , 
76Gc« I l^l 



where Cq, is the constant defined in Theorem [2j Summing over i = 1, . . . , n completes the proof. 



8 Proofs for convex risk minimization 

Finally, we turn to the proofs of our results on convex risk minimization. 
8.1 Convex risk minimization and testing 

To keep our presentation relatively self-contained, we begin with some preliminary results that are 
useful for studying convex risk minimization, drawn in part from our earlier work As in 

the standard approach to minimax bounds (recall Section [2.ip . we begin by reducing convex risk 
minimization to a testing problem. Consider a collection of risk functionals {Ru}yizy indexed by 
a packing set V. For each G V, we choose some representative 0* G argmingg0 Ru{0) of the set 
of all minimizing vectors. Following Agarwal et al. we define a discrepancy measure between 
pairs of risk functionals: 

p{R,,R,') := inf [R,{e) + R,,{e) - R,{ei) - R.^Ol,)] , 
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and the p-separation of the set V is 

p*{V) := min {piK, Ry') -.u.u' eV,u^ v'] . (50) 

When the set V is clear from context, we use as shorthand for this separation. The following 
result is variant of Lemma 2 from Agarwal et al. 

Lemma 5. Let P be a joint distribution over X ^ X and V such that X are i.i.d. given V , 
and such that 

Ep [t{x,e) \ v = u\ = Ru{e). 

Let M he marginal the distribution of the communicated private values Z . Then we have 
Ep,M [en{M,l, e, P)] > inf Pp,Q . . . , Z„) / , 

where the infimum is taken over all test functions : Z"" — )• V. 

The proofs of our lower bounds on convex risk minimization exploit a combination of Lemma O 
Theorem [2] and Fano's inequality. Each lower bound involves the following three steps: 

(1) We begin by constructing a collection of loss functions satisfying Definition [H then compute 
the minimal separation (j50p so that we may apply Lemma [5j 

(2) We provide an upper bound on the mutual information I{Zi, . . . , Zn;V) for our specific choice 
of loss from step [H which requires a careful packing construction to control the variational 
bound of Theorem [2j 

(3) The final step is to use the results of steps [U and [2] in the application of Lemma [5] and Fano's 
inequality Q. 



8.2 Proof of Proposition [3] 

Our lower bound uses a packing of the £i ball to yield its results. Let V = {±6^}^^;^ be the 
2d standard basis vectors and their negations in M.'^. Fix some 6 G [0,1/2], and consider the 
sampling strategy that places all its mass on vectors X = {—1,1}'^, where for G V we have 
Py{X = x) = {I + 5v^x)/2'^. That is, conditional on V (assuming w.l.o.g. that u = ±ej), the 
coordinates of X are independent uniform on {—1, 1} except for the coordinate j, for which Xj = 1 
with probability 1/2 + 6uj and Xj = —\ with probability 1/2 — 6vj . 

For this sampling strategy, we use the linear loss l{x,9) = L{x,9), which we also use in our 



earlier paper 11[. The linear loss is L-Lipschitz continuous with respect to the ^i-norm for any 



X G [— 1, 1]'^, and moreover gives Ri/{9) = L{i>,9) with our sampling strategy. From Lemma 2 in 



the paper 11[, we obtain that with our choice of V and sampling, 

p*{V) = Lr6. (51) 
We also have the following lemma, whose proof we provide in Appendix IC.ll 
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Lemma 6. Under the conditions of the previous paragraph, let 6 < 1 and V be sampled uniformly 

Id 
= 



from {zbcj}^^^. Then for any a- differentially private channel M with a < 1/4, we have 



where is as defined in Theorem\^ 

Using Lemma [6l we can give an almost immediate proof of Proposition [3l Indeed, we have from 
Fano's inequality ([9|), Lemmas [5] and [6l and the separation (|5ip that 

*ro n ^ ^ ^^-^ ^ nC,(e"-e-)25V4d + log2 



log(2d) 



So long as d > 2, setting 



^_ ^d\og{2d) 



and noting that Ca = 0{1) and — e~" < So; for a < 1/4 completes the proof. 

When d = 1, an argument via local packings and Le Cam's method ([8]) yields an identical result. 
We sketch the proof here, though it is quite similar to the arguments used in Proposition [TJ In this 
case we we use the packing set V = {±1} and conditional on V = v, set X = 1 with probability 
(1 + 1'6)/2 and X = —1 with probability (1 — v5)/2. The equality ()5ip still holds, and moreover, if 
we define M" to be the marginal distribution of the samples Zi-n conditioned onV = we have 



|Mf - MWW^ < iz)ki (Mr||M!!i) < 2(e" - ifn \\Pi - P_i\\l^ 



by Pinsker's inequality and Corollary [TJ Here P^, is the distribution X \V = u. By construction. 



the total variation ||Pi — P_i ||rpy = 6, whence we find that ||Mf - M!!;^ ||.py < 2(e° - l)^n5^. 
Applying Le Cam's method ([8]) and Lemma [5l we obtain 

Lrd fl ^/E{e"-l)S 



€:(£,G,a) > 



2 ^2 



Take 6 = (2-v/2y^(e" — 1)) ^ to complete the proof in this case. 

As a minor remark, if in either of the above two cases our choice of 6 would yield 5 > 1/2 
because d is too large or a^n is too small, we take (5 = 1/2 to obtain the desired bound. 



8.3 Proof of Proposition [4] 

The proof of this proposition follows the outline established in Section 18. H as did the previous 
proposition. We begin with two auxiliary lemmas with proofs deferred to the appendices. Our first 
result concerns a packing of the Boolean hypercube: 

Lemma 7. There exists a packing V of the d- dimensional hypercube { — 1, 1}"^ with \\iy — u'W-^^ > d/2 
for each Vju' &V with v ^ v' such that the cardinality ofV is at least [exp(d/16)] and 
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See App endix IB ■ 2 1 for the proof. 

Using Lemma [71 we bound the mutual information between samples Z from a particular distri- 
bution and a random sample V from a set V of the form in the lemma. Indeed, let V be a packing 
of the d-dimensional hypercube specified in Lemma [71 Conditional on y = V G {-l,!}*^, let us 
sample the random vector X G { — 1,0,1}'^ according to the following scheme, where b G [0,1/2] 
will be chosen later: 

-Cj w.p. 

We have the following lemma, which applies so long as the channel Q is a-locally private ([TJ. 

Lemma 8. Let Zi he a-locally differentially private for Xi, and let X he sampled according to the 
distribution ([52]) conditional onV = v. Then 

I(Zl,...,Z„;y)<r^^^(e"-e-")^ 
lb a 

where Ca is defined as in Theorem\^ 

See Appendix IC.2I for the proof. 

We use the hinge loss i{x, 6) = L[r — (x, 6)]^ as our loss function. In this case, it is clear that 
our sampling strategy yields that the loss i{x,6)is uniformly (L, oo)-Lipschitz, since \\x\\-^ < 1. 
Moreover, we have the discrepancy bound (see [ll[ . Lemma 3]) 

P*(V) > ^. 

Consequently, by applying Lemma [8] and Fano's inequality Q to Lemma [5l we obtain 

. rL5 / 25nC,(e°-e-")25Vl6a! + log2 



4 V d/16 
So long as d > 12, we have 161og2/(i < 15/16. Thus choosing 

29^/nC^{e<^ - e"") 

and noting that e° — e~'^ < 3a and Ca = 0{1) for a < 1/4 completes the proof in this case. 

As in the proof of Proposition [Sj when d < 11, we apply an essentially similar argument but 
with Le Cam's method ([8]), which gives the desired result. (Indeed, the proof of the case d = 1 
from Proposition [3] applies here as well.) 



9 Conclusions 

We have developed two inequalities, Theorems [1] and [21 and their Corollaries [THU which allow 
us to give sharp minimax rates for estimation in locally private settings. It is possible to use 
our techniques to derive many other results on the convergence of different estimation procedures; 
indeed, in a forthcoming companion paper to this one, we show how our results extend to probability 
estimation problems, including nonparametric density estimation. 



31 



We believe that our results provide insight into the costs of attaining privacy. In particular, the 
results here show the price that must be paid — in the form of increased sample complexity — when 
providers of the data wish to guarantee their own privacy before any data release. This type of 
guarantee, while certainly desirable, may be untenable for problems in which samples are expensive 
to obtain, sample sizes n are small, or for very high dimensional problems. In quantifying these 
tradeoffs, we hope that our sharp minimax bounds lead to actionable procedures and inform the 
discussion of disclosure risk. 
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A Effects of differential privacy in non-compact spaces 

In this appendix, we present a somewhat pathological example that demonstrates the effects of 
differential privacy in non-compact spaces. Let us assume only that 9 £ M and a < oo, and we 
denote Vg to be the collection of probability measures with variance 1 having as a mean. In 
contrast to the non-private case, where the risk of the sample mean scales as 1/n, we obtain 

= oo (53) 

for all n G N. To see this, consider the Fano inequality version Q. Fix 6 > and choose 
{6»i = 0^02 = 26,... ,9n = 2N6} where N = N{5,n) = max{ [exp(64(e" - ifn)] ,2^}. Then by 
applying Corollary [U we have for V = [M] that 

We have — -Pj/'IItv — ^ distributions Pi, and P^', which implies 

16(e°-l)'n + log2\ ^2 / 1\ _ 1 -2 



OT„(R. (.y.a) >S'{1 - ^^^^^^^^ j >s'^,--)=-S 



Since 6 > was arbitrary, this proves the infinite minimax risk bound (j53p . The construction to 
achieve (I53p is somewhat contrived, but it suggests that care is needed when designing differentially 
private inference procedures, and shows that even in cases when it is possible to attain a parametric 
rate of convergence, there may be no (locally) differentially private inference procedure. 



B Packing set constructions 

In this appendix, we collect proofs of the constructions of our packing sets. 
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B.l Proof of Lemma [3] 



The first statement of the lemma follows from an application of the probabilistic method. Consider 
the event E that there exists a collection of vectors . . . , such that (1/A^) Xl^i v'^iy'^)^ -< 
((1 + S)/d)Idxd- By following the proof of Duchi et al. [ij], the event £ holds whenever 

N\ ( md\ ( N6^\ 



(See equation (23) in the paper [Ij].) For d > 16, choosing 6 = 1 and N = [exp(49d/256)] yields 
the desired inequality. For d < 16, this inequality fails, but a simpler argument gives the result. 
The choice of V = {u/ \\u\\2 : u G { — 1, 1}'^} yields |V| = exp((ilog2), and by inspection 

j^^^i^u'^ = {l/d)Idxd, and \\u - = -^\\u - u\\^ > = ^ 

for u ^ u' & l}*^- Combining the pieces yields the claim. 
B.2 Proof of Lemma [7] 

We again use the probabilistic method. Consider a set of N vectors z^* G {—1) 1}^ sampled uniformly 
at random from the Boolean hypercube, and for a fixed t > 0, define the two "bad" events 

r 1 ^ 1 

Bi:={3i^j\\\u' -u^\\^<d/2], and B2{t) := \-Y,u\v')' ^ {t + l)h^A. 

^ i=i J 

We begin by analyzing Bi. Letting {W^j^^j^ denote a sequence of i.i.d. Bernoulli {0, 1} variables, for 
any i ^ j, the event — i^m^ < d/2} is equivalent to the event {Y^'l^i Wi < d/A}. Consequently, 
by combining the union bound with the the Hoeffding bound, we find that 



¥{Bi)<(^^y{\Wi-u^\\^<d/2) < (^^) exp 



i-d/8). (54) 

Turning to the event B2{t), we have ;^ X^i^i 2^ (i + l)Idxd if and only if the maximum 

eigenvalue Amax(-^ ^H^*)"'' ~ Idxd) is larger than t. Using sharp versions of the Ahlswede- 
Winter inequalities [2] (see Corollary 4.2 in the paper j3l), we obtain 

P(i32(t)) <dexpf-^j . (55) 

Finally, combining the union bound with inequalities (|54p and (j55p . we find that 

F{B,UB2{t)) < ^^^-^^ exp(-d/8)+(iexp (-^) • 

By inspection, if we choose t = 24 and N = [exp(d/16)], the above bound is strictly less than 1, 
so a packing satisfying the constraints must exist. 
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C Proofs of lemmas for convex risk minimization 



In this appendix, we collect the proofs of various lemmas associated with convex risk minimization. 
C.l Proof of Lemma [6] 

We use the notation of Theorem [21 recalling the linear functionals : L°°{X) — t- M. Because the 
set X = {—1, 1}'^, we can identify vectors 7 G L°°{X) with vectors 7 G M^''. Moreover, we have (by 
construction) that 

x6{-l,l}'' a;e{-l,l}'' 

For each G V, we may construct a vector Ui, G {—1, 1}^'', indexed by x G {—1, 1}'', with 

-r I 1 if = ibe,- and sign(z^,) = sign(x,) 

Uiy{x) = V X = < 

1—1 a u = ±ej and sign(z/'j) 7^ sign(xj). 

For v = ej, we see that Ue^ , • • • , Ue^^ are the first d columns of the standard Hadamard transform 
matrix (and U-ej are their negatives). Then we have that "^^^x '^i-'^)'^'^ ~ ^'^d 

Note also that Uyul = u^yuLjj, and as a consequence we have 

vev i^ev j=i 



But now, studying the quadratic form (|56|) . we note that the vectors u^.^ are orthogonal. As 
a consequence, the vectors (up to scaling) are the only eigenvectors corresponding to positive 
eigenvalues of the positive semidefinite matrix UejuJ.. Thus, since the set 

= {7 G : hIL < (e" - e^")/2} C {7 G M.'" : Ml < A'-\e» - e-^'} , 
we have via an eigenvalue calculation that 

sup 'P'^i^f ^ IT X] ^e.'^J')' 

since ||i*ej||2 = for each j. Applying Theorem [2] and Corollary [4] completes the proof. 
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C.2 Proof of Lemma [8] 



Our strategy is to apply Theorem [2] to bound the mutual information. We note that since the set 
X = {zbej}^^]^, under the notation of Theorem [2l we may identify vectors 7 S L°°{X) by vectors 
7 e M^"*. Moreover, if we define v — Sj/gv ^ th.e mean element of the packing set, then 

the linear functional ipi, defined in Theorem [2] is given by 



1 + ly^S 



d 



1 - Uj5 



1 

2d 



1 - Uj6 





T 






= 4d^ 


-/ 



In particular, we have that 



T 



(4d) 
(4d)2 

S2 



1 ^(^_^)(^_,7)T[J _/] 



7 



7 



T 

< ^ 

- (4d)2^ 

25 52 J 
< ' 

- 16 



V ' ' 1/GV 

/[/ -/]7 



55 
4d 



(57) 



Here the final inequality used our assumption on the sum of outer products in V. 

We complete our proof using the bound ()57p . We note that the orthogonal collection of eigen- 
vectors of the matrix specified in ([57|) are vectors of the form [ej ej]^ E M.'^'^, with eigenvalue 0, 
and [eJ 



€ M^*^, with eigenvalue 2. As a consequence, since we have the containment 



7G 



r,2d 



■ hlloo < 



-)/2} C { 



7 e 



r,2d 



: ||7||^ < d(e" 



we have the inequality 



25(52 2d(e" - e" 
16^2 2~~ 



-o\2 



25^ 
16 d 



Applying Theorem [2] completes the proof. 



D Achievability by stochastic mirror descent 

In this appendix, we provide further details on the stochastic mirror descent algorithm used to 
achieve the upper bounds in Propositions E] and [H 
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D.l Convergence guarantees 



We begin by reviewing known convergence guarantees for the stochastic mirror descent algo- 
rithm (|37p . The important consequences for our analysis are the following convergence rate 
guarantees, which rely on the average vector 9n := ^ EHi First, if we have the bound 
^[IIS'illL] — -^oo ^'^d the containment C {9 € M'^ : < ri}, then by choosing the proxi- 



mal function il){u) = \ \\u\\^ with p = 1 + 1/logd, the update (f37|l attains convergence rate 



V ^ 



for some universal constant c. When Edl^^Hg] < L| and Q C {9 & : \\9\\2 < r2}, the standard 
(Euclidean) choice V'(^) = ^ 11^112' which yields stochastic gradient descent in the update ([37|) . 
provides the convergence guarantee 

E[R(9n)] - R{9*) < c^^2. (58b) 
For proofs of the results (|58ap or (j58b|) . see for example Beck and Teboulle 0, Section 5] or Ne- 



mirovski et al. [30|, Sections 2.2-2.3]. 



D.2 Achievability for Proposition [3] 

Recall the family of loss functions £(IBi(r);L): by definition, any loss i £ £{Mi{r); L) satisfies the 
bound ||50^(x, 0)11^ < L. In this case, Duchi et al. show that the sampling strategy ()38bp . 
if we choose M = cy/dL/a for a (universal) constant c, yields E[Zt | gt] = gt, and moreover, we 
see that by inspection ||^f||^ = c^dL^/a^. Combined with the convergence guarantee (I58ap . this 
shows that Proposition [3] is sharp. (See also the upper bound in Corollary 1 of the paper [ll[.) 



D.3 Proof of Proposition [5] 

It suffices to compute the expectation of a random variable Z sampled according to the strat- 
egy (j38ap . after which we may directly apply the convergence guarantee ()58bp . With that in mind, 
we compute K[Z \ g] for a vector (7 G R'^. By scaling, it is no loss of generality to assume that L = 1 
and \\g\\2 = 1, and using the rotational symmetry of the £2-ball, we see it is no loss of generality to 
assume that g = ei, the first standard basis vector. 

Let the function denote the surface area of the sphere in W^, so that 

r(d/2 + i)'' 

is the surface area of the sphere of radius r. (We use as a shorthand for 5^(1) when convenient.) 
Then for a random variable W sampled uniformly from the half of the ^2-ball with first coordinate 
Wi > 0, symmetry implies that by integrating over the radii of the ball, 

E[W] = ei— [ Sd^i (V'i^-rA rdr. 
Sd Jo ^ ' 
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Making the change of variables to spherical coordinates (we use (j) as the angle) , we have 

r-n/2 n_,^ _ fn/2 



— [ Sd~i (V 1 — J"^) rdr = — [ Sd^i {cos (f)) sin (f) d(f) = — ^^—^ [ 
SdJo ^ ' Sd Jo Sd Jo 

Noting that ^ cos"'~-'^(0) = —{d — 1) cos'^~'^ {(f>) sm{(j)), we obtain 



cos'^-^cp) sm{<j)) d<J). 



' cos [(p) sm[(p) d(p — 



d-l' 



Sd Jo d-l 
or that 

^ ^ (d- i)7r^r(^ + i) 1 r(f + i) 

W] = ei- J = ei -h-r^ , 59 



-■Cd 

where we define the constant q to be the final ratio. 

With the expression (|59p . we see that for our sampling strategy for Z, we have 

nz I g] = gjcd -7^1)= T^'^^i 

Consequently, the choice 



B 



+ 1 L _ + 1 Ly/l^dT{^ + 1) 
e° - ~ e° - 1 rXfTT) 



yields E[Z | = (7. Moreover, we have 

IIZII, = B < LS:±iM^ 

by Stirling's approximation to the T-function. By noting that (e" + l)/(e" — 1) < 3/a for a < 1, 
we see that \\Z\\2 < ^L^fd/a. 

To complete the proof, we make a few more remarks. If £ is L-Lipschitz with respect to the 
^p-norm for p £ [2, 00], it is Lipschitz with respect to the ^2-iiorm since \\g\\2 < \\g\\q ^ L for the 
q < 2 conjugate to p, that is, 1/p + 1/g = 1. As a consequence, by applying the convergence 
guarantee ()58bp with our sampling scheme for the unbiased gradient vectors Zt, we obtain 



which is our desired result. 

E Proof of Lemma [2] 

For any x,y > 0, the concavity of the logarithm implies that 

log(y) < log(2;) + - — -. 
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Setting X = 1 and y = {a + h)/{a + c), we find that the inequahty 

a + h a + h b — c 
log < 1 = . 

a + c a + c a + c 

On the other hand, setting x = 1 and y = (a + c)/(a + 6), we find the inequahty 

a + c a + c c — b 

log < 1 = . 

a + b a + b a + b 

Using the first inequality for a + b > a + c and the second for a + b < a + c completes the proof. 
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